3.3 Q4 Term Frequency-Inverse Document Frequency (TF-IDF)
To run a TF-IDF, we used Spark-NLP Processing Job because of the huge dataset size with the HashingTF feature transformer. When using Spark’s HashingTF feature transformer, the challenge is that it hashes words into a fixed-size feature vector. This hashing process makes it efficient but also means you lose the direct mapping between words and their indices in the feature vector, which can make it difficult to retrieve the original words from the indices.
Table 2: Top 10 Words by TF-IDF Scoring: Highlighting Unique Vocabulary
Term | Frequency | |
0 | blockchain | 2159.020063 |
1 | burning | 1217.472828 |
2 | adventures | 988.120861 |
3 | above | 969.387716 |
4 | buffet | 968.732854 |
5 | are | 927.111189 |
6 | ceoofdogecoin | 870.622154 |
7 | 240k | 827.099049 |
8 | announces | 817.186108 |
9 | career | 780.541266 |
(Term frequency from sample dataset)